Cooperativity boosts affinity and specificity of proteins with multiple RNA-binding domains

Abstract Numerous cellular processes rely on the binding of proteins with high affinity to specific sets of RNAs. Yet most RNA-binding domains display low specificity and affinity in comparison to DNA-binding domains. The best binding motif is typically only enriched by less than a factor 10 in high-throughput RNA SELEX or RNA bind-n-seq measurements. Here, we provide insight into how cooperative binding of multiple domains in RNA-binding proteins (RBPs) can boost their effective affinity and specificity orders of magnitude higher than their individual domains. We present a thermodynamic model to calculate the effective binding affinity (avidity) for idealized, sequence-specific RBPs with any number of RBDs given the affinities of their isolated domains. For seven proteins in which affinities for individual domains have been measured, the model predictions are in good agreement with measurements. The model also explains how a two-fold difference in binding site density on RNA can increase protein occupancy 10-fold. It is therefore rationalized that local clusters of binding motifs are the physiological binding targets of multi-domain RBPs.


Calculating avidities for n binding sites
To generalize our approach to n binding sites, we can similarly express the avidity according to the law of mass action in terms of the equilibrium concentrations of all states: The K a,(0...0,x) can be written as a product where |x| is the number of bound sites at state x and p x describes a path from the unbound state (0 . . . 0) to state x = p x (|x|) (for instance flipping unbound sites to bound sites in the order from leftmost to rightmost RNA site). We define K a,px(i−1) px(i) as the association constant of the reaction between configurations p x (i − 1) and p x (i) along the path p x (e.g. Figure 2D, main text). Because for the first factor we can write equation (4) can finally be written as It can readily be seen that is a generalization of known results with n = 2. If the individual factors K a,px(i−1) px(i) = K a,i c i−1,i are much larger than 1, which is equivalent to the local concentrations saturating the nearest unbound binding site, the last term of the sum, with the most factors, dominates This shows how each added binding site approximately multiplies the total K av by a factor K a,i c i−1,i .

Effective concentrations using the worm-like chain model
To estimate the effective concentration of a protein domain at an RNA binding site if another domain of the protein already binds the RNA, we model the RNA and the disordered peptide linker between RNA-binding domains as worm-like chains. We use estimates of the persistence length l p of RNA, which quantifies a chain's stiffness. We call L the chain length between binding sites on the RNA. We call d the distance in 3D space between binding domains on the protein. The local concentration of the second domain at the second RNA binding site can be identified with the probability density of the second RNA site at a 3D distance of d from the first, bound protein domain ( Figure  3B, main text). Given a large enough chain length between the binding sites, L l p , the radial distribution function tends towards a Gaussian distribution, with a variance of Figure S1: Protein with flexible linker between domains and its RNA target. To describe the effective concentration of the second RNA binding site at the second protein domain, we introduce R 1 , R 2 and L Protein as new parameters.

Effect of flexible peptide linker on the effective concentration
The above treatment assumes a rigid connection between protein domains. If the protein binding domains are connected by flexible linkers and we allow them to move independently, we need the additional distances R 1 , R 2 and L Protein to describe the local concentration, c eff ( Figure S1). It can be expressed by a convolution of three distributions: By integrating equation (11) over r p and defining we can write it as To integrate over the spherical shell with radius R 2 , we transform to spherical coordinates and finally arrive at 4 Procedure for calculating an avidity for "fuzzy" binding and n = 2 In analogy to the above described treatment, we write the K av for two domains, when we allow "fuzzy" binding (every domain can bind every other binding site), in terms of the concentrations of all binding configurations as K av (2) where K a1 2 and K a2 1 denote the association constant for the first domain binding to the second RNA binding site, and the second domain binding to the first RNA binding site, respectively.

Simulating cooperative binding with Gillespie stochastic simulation algorithm
The Gillespie algorithm allows us to define model parameters and reactions, and based on these obtain the time dependent concentrations of all species in the system after a simulation. To build the model we used the Python library Gillespy2. The parameters of the system are as described in the main text. Before the simulation, all possible molecular entities in the system are defined, which in our case correspond to the different binding configurations. Furthermore, we define all possible reactions between these configurations with rate constants, where the reactions are binding and unbinding events. Because only the ratio between rates determines the end state, the off-rate constant of individual domains k off is set to k off = 1 and according to K d = k off kon the respective on-rate constant is k on = 1 K d . We assume the rate constants of the unbinding events to be independent of the current binding state. The on-rate constants are calculated based on the current state as k * on = k on c eff with k on the on-rate constant for the individual domain.
c eff is expressed according to the worm-like-chain model as a Gaussian distribution. When looking at a system with more than 2 binding sites, there are going to be reactions where an unbound binding site is between two bound binding sites as seen in the reaction 101 −− −− 111. The effective concentration in this case has to be expressed according to the laws of probability as the normalized product of two Gaussians. For the normalized product of two Gaussians we get In our case we have d = d 1 , µ 1 = 0 and µ 2 = d 1 + d 2 with d 1 and d 2 the distances between binding domains on the protein.
The K d is then calculated based on the appropriate concentrations after reaching equilibrium in the simulation.

Calculating K d values for KSRP
In addition to the examples in the main text of proteins with two domains, for which the K d s of individual domains and the full-length protein were measured, here, we analyze the four domain protein KSRP, binding to a 34 nt AU-rich RNA [1]. KSRP consists of four KH domains. The second and third domain are linked as a rigid unit, while the first and fourth domain are connected by flexible linkers [2][3][4]. We predict the avidity of the full length protein and four different variants, in which mutations in individual binding domains remove their ability to bind RNA. According to the experimental setup, we make the following assumptions in addition to the ones described in the main text.
First, the RNA used in this study does not contain defined binding motifs. Rather, it consists of a stretch of 34 AU-rich nucleotides. The KH-domains do not have a particular sequence specificity for such sequences [3], which is why we assume that they can bind at any position. In line with the assumptions we describe for equation 5 in the main text, we make the approximation that RNA-protein configurations, in which all four domains are bound with minimal sequence length between bound nucleotides on the RNA dominate the population and thus the K d . This is a result of our model, in which a shorter linker length leads to higher local concentrations, so that the closest possible binding site will saturate at the expense of other binding sites. We only take these binding arrangements into account. We can calculate the number of possible binding arrangements of each protein variant and use equation 5 in the main text to calculate the avidity. The number of possible arrangements is given by The factor 2 is a result of the fact that the domains have no specificity for any motifs in this RNA and the whole protein can thus bind in a 'forward' or 'reverse' orientation. The footprint of the protein is calculated as the sum of all bound nucleotides (4 nt per protein domain) plus the sum of RNA linker lengths L ij between occupied nucleotides. RNA linkers lengths were derived from structural data [2][3][4]. This results in binding footprints between 14 nt and 22 nt (mutated sequences or full length protein, see tables below).
Second, the study was not able to produce an exact measurement for the K d of the first domain, which was reported as > 1 mm. In one case (mutated first domain), this affinity is not needed. In the other three cases, however, we make an educated guess and set K d,1 = 3 mm.
The following tables summarize the parameters we use and the results from the calculations: Even though we had to make some assumptions in addition to the main model, the results of this analysis are very self consistent and we are able to replicate all trends that are visible between the measurements of the wildtype and the mutants. These calculations thus further validate our model.